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C^ Abstract. We study the problem of computing semantic-preserving word clouds 

^~~^ in which semantically related words are close to each other While several heuris- 

^^ tic approaches have been described in the literature, we formalize the underlying 

geometric algorithm problem: Word Rectangle Adjacency Contact (WRAC). In 



(N 



?-H this model each word is a rectangle with fixed dimensions, and the goal is to rep- 



Oh 



X 



resent semantically related word pairs by contacts between their corresponding 



■^^ rectangles. We design and analyze efficient polynomial- time algorithms for vari- 

_-K ants of the WRAC problem, show that some general variants are NP-hard, and 

pvj describe several approximation algorithms. Finally, we experimentally demon- 

strate that our theoretically-sound algorithms outperform the early heuristics. 

^^ 1 Introduction 

'"; Word clouds and tag clouds are popular tools for visualizing text. The practical tool, 

Y^ Wordle [21] took word clouds to the next level with high quality design, graphics, style 

i__i and functionality. Such word cloud visualizations provide an appealing way to sum- 

marize the content of a webpage, a research paper, or a political speech. Often such 
"^ visualizations are used to contrast two documents; for example, word cloud visualiza- 

vQ tions of the speeches given by the candidates in the 2008 US Presidential elections were 

^-_l used to draw sharp contrast between them in the popular media. 

(^ While some of the more recent word cloud visualization tools aim to incorporate 

OO semantics in the layout, none provide any guarantees about the quality of the layout in 

■^ terms of semantics. We propose a formal model of the problem, via a simple vertex- 

^^ weighted and edge-weighted graph. The vertices in the graph are the words in the doc- 

^^ ument, with weights corresponding to their frequency (or normalized frequency). The 

. . edges in the graph correspond to semantic relatedness, with weights corresponding to 

^ the strength of the relation. Each vertex must be drawn as a rectangle or box with fixed 

dimensions and with area determined by its weight. The goal is to "realize" as many 
edges as possible, by contacts between their corresponding rectangles; see Fig. 1. 



1.1 Related Work 

The early word-cloud approaches did not explicitly use semantic information, such 
as word relatedness, in placing the words in the cloud. More recent approaches attempt 
to do so. Koh et al. [11] use interaction to add semantic relationship in their Mani Wor- 
dle approach. Parallel tag clouds by Collins et al. [2] are used to visualize evolution 
over time with the help of parallel coordinates. Cui et al. [3] couple trend charts with 
word clouds to keep semantic relationships, while visualizing evolution over time with 
help of force-directed methods. Wu et al. [22] introduce a method for creating semantic- 
preserving word clouds based on a seam-carving image processing method and an ap- 
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Fig. 1: A hierarchical word cloud for complexity classes. A class is above another class 
when the first contains the second. The font size is the square root of millions of Google 
hits for the corresponding word. This is an example of the hierarchical WR AC problem. 

plication of bubble sets. Hierarchically clustered document collections are visualized 
with self-organizing maps [12] and Voronoi treemaps [14]. 

Note that the semantic -preserving word cloud problem is related to classic graph 
layout problems, where the goal is to draw graphs so that vertex labels are readable 
and Euclidean distances between pairs of vertices are proportional to the underlying 
graph distance between them. Typically, however, vertices are treated as points and 
label overlap removal is a post-processing step [5, 10]. 

In rectangle representations of graphs, vertices are axis-aligned rectangles with 
non-intersecting interiors and edges correspond rectangles with non-zero length com- 
mon boundary. Every graph that can be represented this way is planar and every triangle 
in such a graph is a facial triangle. These two conditions are also sufficient to guarantee 
a rectangle representation [1,9, 17, 19,20]. Rectangle representations play an impor- 
tant role in VLSI layout and floor planning. Several interesting problems arise when 
the rectangles in the representation are restricted. Eppstein et al. [6] consider rectangle 
representations which can realize any given area-requirement or perimeter-requirement 
on the rectangles. In a recent survery Felsner [7] reviews many rectangulation vari- 
ants, including squarings. Nollenburg et al. [15] consider rectangle representations of 
edge-weighted graphs, where edge weights are proportional to the lengths of the corre- 
sponding contact. 

1.2 Our Contributions 

In the formal study the semantic word cloud problem we encounter several novel 
problems. The input to all problems is a set of n axis -aligned boxes Bi, . . . , _B„ with 
fixed dimensions, e.g., box Bi is encoded by {wi, hi), where Wi and hi its width and 
height. Further, for every pair {i, j}, i y^ j,a non-negative profit pij represents the gain 
for making boxes Bi and Bj touch. The set of non-zero profits can be seen as the edge 
set of a graph whose vertices are the boxes, called the supporting graph. 

We define a representation of the boxes Bi, . . . , B^ to be the positions for each box 
in the plane, so that no two boxes overlap. A contact between two boxes is a common 
boundary. If two boxes are in contact, we say that these boxes touch. Finally, define the 
total profit of a representation to be the sum of profits over all pairs of touching boxes. 
Next we summarize the results in this paper; 



Word Rectangle Adjacency Contact (WRAC): We are given n boxes with fixed 
height and width each, and for each pair of boxes Bi ^ Bj a profit pij , which is either 
or 1. The task is to decide whether there exists a representation of the boxes with total 
profit '^j^^jPij- This is equivalent to finding a representation whose induced contact 
graph contains the supporting graph as a subgraph. If such a representation exists, we 
say that it realizes the supporting graph and that the instance of the WRAC problem 
is realizable. We show that this problem is NP-complete even if restricted to a tree as 
a supporting graph. We also show that the problem can be solved in linear time if the 
supporting graph is quasi-triangulated. 

Hierarchical Word Rectangle Adjacency Contact (Hi-WRAC): This is a more 
restricted, yet useful, version of the WRAC problem where the supporting graph is 
directed, planar, with a fixed embedding, and a unique sink. The task is to find a repre- 
sentation in which every contact is horizontal with the end-vertex of the corresponding 
directed edge on top; see Fig. 1 . We show how to solve this problem in polynomial time. 

Maximum Word Rectangle Adjacency Contact (Max-WRAC): This is an op- 
timization problem. The task is to find a representation of the given boxes, which max- 
imizes the total profit. We show that the problem is weakly NP-hard if the supporting 
graph is a star and present several approximation algorithms for the problem: a constant- 
factor approximation for stars, trees, and planar graphs, and a ^qif -approximation for 
supporting graphs of maximum degree A. We consider an extremal version of the Max- 
WRAC problem and show that if the supporting graph G ~ Kn (n > 5) and each profit 
is 1, then there always exists a representation with total profit 2n — 2 and that this is 
sometimes best possible. Such a representation can be found in linear time. 

Minimum Area Word Rectangle Adjacency Contact (Area-WRAC): Given 
an instance of the WRAC problem, which is already known to be realizable, find a 
representation that realizes the supporting graph and minimizes the area of the bounding 
box containing all boxes. We show that this problem is NP-hard even if restricted to even 
simpler graphs as supporting graphs, namely independent sets, paths, or cycles. 

2 The WRAC problem 

Theorem 1. WRAC is T>iP-complete even if the supporting graph is a tree. 

Proof. It is easy to verify a solution of the WRAC problem in polynomial time, so 
the problem is in NP. To show that the problem is NP-hard we use a reduction from 
3-Partition, which is defined as follows. Given a multiset 5* = {si, S2, . . . , s„} 
of n = 2>m integers with X]"=i ■*« — mB, is there a partition of 5* into m subsets 
5*1, ... , Sm such that in each subset the numbers sum up to exactly Bl This classical 
problem is known to be NP-complete even if for every i we have 5/4 < Si < B /2, 
in which case every subsets Sj must contain exactly three elements. We also assume 
w. 1. o. g. that B > (m — l)/2, which can be achieved by scaling all Si appropriately. 

Given an instance S = {si, S2, . . ■ , s„} of 3-Partition, n ~ 3m, X]"=i ^i = 
mB, we define a tree Tg on 2n + 4 vertices as follows. There is a vertex Vi for i = 
1, . . . , n, a vertex Wj for j = 1, . . . , jti, a vertex Uj for j = 1, . . . , tti — 1, a vertex Xj 
for 7 = 1, . . . , m — 1, a vertex c, and five vertices ai, a2, as, 04, 05. Vertex cis adjacent 
to all vertices except for wi, . . . , Wm and xi, . . . , Xm-i- For j = 1, . . . , m — 1 vertex 
Uj is adjacent to Wj and Xj, and finally Um-i is adjacent to Wm', see Fig. 2. 
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Fig. 2: The tree T5, created from a set 5 of n integers and a representation realizing Tg. 

For each vertex we define a box by specifying its height and width. For simphcity 
let us write v -^ {h, w) to say that the box for v has height h and width w. Using 



this notation we define m, 



1, 



{l,iB- 



-^ {B + J, 1) for J 
-B+i) for j = 1,... 



= 1, . . . , m — 1, Wj -^ (B, B) for 
, m— 1, Wi — > (1, Si) fori = 1, . . . , n, 
c ^^ (1, m_B + ?7i — 1), and a^ ^ (m_B + jti — 1, mB + m — 1) for k = 1,2, 3, 4, 5. 

We claim that an instance S of 3 -PARTITION is feasible if and only if the instance of 
WRAC defined above is feasible. To this end, consider any representation that realizes 
Ts- We refer to Fig. 2 for an illustration. We abuse notation and refer to the box for a 
vertex v also as v. The box c has height 1 and width mB + m — 1. Since c touches 
the five mB + m — 1 x mB + m — 1 squares ai, 02, 03, 04 and 05, each a^ contains 
a corner of c. It follows that at least three sides of c are partially covered by some ak 
and at least one horizontal side of c is completely covered by some ak- Because c has 
height 1 only, but touches the boxes vi, . . . , Vn, ui, . . . , Um-i (each of height at least 
1), all these boxes touch c on its free horizontal side, say the bottom. Indeed the widths 
of ui , . . . , w„ , Ui , . . . , Um- 1 sum exactly to the width of c. 

Now Um-i touches Xm-i whose width is also m,B + m — 1. Since u„i_i has height 
B + TO — 1 < m,B + TO, — 1, the top of a;„i_i touches the bottom of Um-i and the left 
and right of Xm-i touch some a^ each. Since Um-i also touches the B x B squares 
Wm~i and Wm and Um-i has height _B + m — 1 < 2B, there is one square on each side 
of Mm-i- Then t Um-i and the rightmost a/, are at horizontal distance of at least B. 

The height of u„j_2 is by one less than the height ofum-i- Moreover, Um-2 touches 
Xm-2 whose width is by i? + 1 less than the width of a;m_i. This forces Xm-2 to touch 
some Qk on the left, u„i-i on the right and w„i_2 on top. Moreover, Um-2 has Wm-2 
on its left side. It follows that Um-2 and u„i-i have a horizontal distance of at least B. 

Similarly, for alH = to — 1, . . . , 2 the boxes Ui and Ui_i, as well as the box ui 
and the leftmost box a^, have a horizontal distance of at least B. Now the width of c 
being mB + m — 1 forces all these distances to be exactly B. Thus the boxes vi, . . . ,Vn 
are partitioned into m subsets corresponding to the to spaces between the leftmost ak, 



all the Uj, and the rightmost a^. Since Vi has width Si 



^ 1 



in each subset the 



numbers sum up to exactly B. 

Along the same lines one can easily construct a representation realizing Tg based 
on any given solution of the 3 -PARTITION instance 5*. This concludes the proof D 
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Fig. 3: Left: starting configuration with rays vs and vw- Center: representation at an 
intermediate step: vertex w fits into concavity p and results in a staircase, vertex v fits 
into concavity s but does not result in staircase. Adding box w to the representation 
introduces new concavity q, and the vertex at concavity r may be applicable. Right: 
there is no applicable vertex and the algorithm terminates. 

By Theorem 1 the WRAC problem is NP-hard if the supporting graph is tree, and 
thus it is NP-hard in general. However, there are classes of supporting graphs for which 
the problem can be solved efficiently. 

A rectangle representation is called a rectangular dual if the union of all rectangles 
is again a rectangle whose boundary is formed by exactly four rectangles. A graph 
G admits a rectangular dual if and only if G is planar, internally triangulated, has a 
quadrangular outer face and does not contain separating triangles [1]. Call such graphs 
quasi-triangulated. The four outer vertices of a quasi-triangulated graph are denoted by 
vn, ve, Vs, v\y in clockwise order around the outer quadrangle. A quasi-triangulated 
graph G may have exponentially many rectangular duals. However, every rectangular 
dual of G can be built up by placing one rectangle at a time, always keeping the union 
of placed rectangle in staircase shape. 

Theorem 2. WRAC can be solved in linear time for quasi-triangulated support graphs. 

Proof (Sketch). The algorithm greedily builds up the quasi-planar supporting graph G. 
Start with a vertical and a horizontal ray emerging from the same point p, as place- 
holders for the right side of vw and the top side of vs, respectively. Then at each step 
consider a concavity - a point on the boundary of the so far constructed representa- 
tion which is a bottom-right or top-left comer of some rectangle - with p as the initial 
concavity. Since each concavity p is contained in exactly two rectangles, there exists 
a unique rectangle Rp that is yet to be placed and has to touch both these rectangles. 
If by adding Rp we still have as staircase shape representation, then we do so. If no 
such rectangle can be added, we conclude that G is not realizable. See Fig. 3 for an 
illustration; the complete proof is in the Appendix. D 

3 The Hi-WRAC problem 

The Hl-WRAC problem is a more restricted variant of the WRAC problem, but it can 
be used in practice to produce word clouds with a hierarchical structure; see Fig. 1. In 
this setting the input is a plane embedded graph G with an acyclic orientation of its 
edges such that only one vertex has no outgoing edges, called a sink. The task is to find 
a representation that hierarchically realizes G, that is, it induces G with its embedding 
as a contact graph and for every directed edge w — > w in G the box for v touches the 



box for w with its top side. In particular, every contact is horizontal and going along 
directed edges in the graph corresponds to "going up" in the representation. 

If the embedding of G is not fixed, it is easy to adapt the proof of Theorem 1 
to show that the problem is again NP-complete, already for trees. Indeed, one simply 
has to remove the vertices Ofe, k ~ 1,2, 3, 4, 5, and orient the remaining edges of Ts 
according to the representation shown in Fig. 2. However, if we fix the embedding of 
the supporting graph G and there is exactly one sink, then the Hl-WRAC problem is 
polynomial-time solvable. 

Theorem 3. The Hl-WR AC problem can be solved in polynomial time. 

Proof. Let G be the given supporting graph, i.e., a directed embedded planar graph with 
vertex set of boxes B ~ {i?i, . . . , Bn}. Let hi and Wi be the height and width of box 
Bi, i = 1, . . . , n, and Bi be the unique sink. Our algorithm consists of three phases. 

Phase 1: Here we check whether the orientation and embedding of G are compat- 
ible with each other. Indeed the orientation of G must be acyclic, and going clockwise 
around every vertex the incident edges must come as a (possibly empty) set of incoming 
edges followed by a (possibly empty) set of outgoing edges. If one of the two properties 
fails, then G can not be hierarchically realized and the algorithm stops. 

Phase 2: Here we check whether the given heights of boxes are compatible with the 
orientation of G. More precisely, we set for each box Bi two numbers low^ and high^, 
which correspond to the y-coordinate of the bottom and top side of Bi, respectively. 
In particular, we set lowi = 0, for every i = 1, . . . , n we set high^ = low^ + hi, and 
for every edge Bi -^ Bj we set highj = low^. This can be done with one iteration of 
breadth-first search of G. If one number would have to be set to two different values, 
then G can not be hierarchically realized and the algorithm stops. 

Phase 3: Here we check whether the given widths of boxes are compatible with the 
orientation and embedding of G and compute a representation hierarchically realizing 
G, if it exists. Since we already know the y-coordinates for each box it suffices to 
compute a valid assignment of a:;-coordinates. To avoid overlaps, any two boxes whose 
y-coordinates intersect interiorly must have interiorly disjoint x-coordinates. Since G 
has a unique sink we can determine which of the two boxes lies to the left and which to 
the right: consider for every box Bi the leftmost and rightmost directed path from B^ 
to Bi and say that Bi lies to the left of Bj if the leftmost path of Bi joins the leftmost 
path of Bj from the left. Similarly, Bi lies to the right of Bj if the rightmost path of Bi 
joins the rightmost path of Bj from the right. Note that if Bi lies to the left of Bj then 
Bj does not lie to the left of Bi, but Bi may also lie to the right of Bj. More precisely, 
we introduce for each box Bi two variables Icft^ and right^, which correspond to the 
a;-coordinate of the left and right side of Bi, respectively. We consider the equations 

rightj = Icfti +Wi for i = 1, . . . , n (1) 

which ensure that each box Bi has width Wi. When the y-coordinates of Bi and Bj 
intersect interiorly, i.e., if maxjlowj, lowj} < minjhighj, high }, we have inequalities 

right j < left J for B^ to the left of Bj, and (2) 

lefti > rightj for B^ to the right of Bj (3) 



which ensure that Bi and Bj do not intersect interiorly. Finally, for every directed edge 
Bi -^ Bj we consider the inequalities 

rightj > Icftj and (4) 

lcfti<rightj. (5) 

which ensure that boxes Bi and Bj touch. It is easy to verify that the solutions of 
the system of linear equations (1) and inequalities (2)-(5) on variables Icft^ and right^ 
correspond to representations hierarchically realizing G. Thus if a solution is found, the 
algorithm defines a representation by placing box Bi with its bottom-left corner onto the 
point (lefti, lowi), z = 1, . . . , n. If no solution exists, then G can not be hierarchically 
realized and the algorithm stops. 

The first two phases can be easily carried out in linear time. In the third phase, 
finding all leftmost and rightmost paths and deciding for every pair Bi, Bj whether Bi 
lies left or right of Bj, can also be done in linear time. Setting up the equations and 
inequalities takes at most quadratic time since there are 0{n'^) inequalities. The rest 
boils down to linear programming, and hence, in polynomial time. (A feasible solution 
can be found faster than with LP, but we leave the details out of this paper.) D 

4 The Max-WRAC problem 

We begin by showing that Max-WRAC is NP-hard, even for simple supporting graphs. 
Since this version of the problem is particularly relevant in practice, we also present 
approximation algorithms for several different classes of supporting graphs. 

4.1 NP-hardness 

Theorem 4. Max-WRAC is (weakly) T>iP -hard if the supporting graph is a star. 

Proof (Sketch). We use a reduction from the well-known KNAPSACK problem, where 
the task is to decide if there exists a subset S* of n given items, each with weight Wi> 
and a profit pi > 0, that fits into a knapsack with capacity G, i.e., '^i^^g Wi < G, and 
yields a total profit of at least P, i.e., '^i^gPi > P. 

The reduction is similar to the one presented in the proof of Theorem 1. We define 
an edge-weighted star Sj with a vertex Vi for each item, a vertex c which is the center of 
the star, and five vertices ai , a2 , as , a4 , as that block all but one side of c. The rectangle 
for each Vi has width Wi, height 1 and the profit for its edge with c is pi. The rectangle 
for c has width C and height l.Forfc = 1, 2, 3,4, 5 the rectangle for Ofe is a CxC square 
and the profit of the edge akcis^pi, which ensures that in every optimal solution these 
edges are realized. 

It is now straightforward to check that a subset S of items can be packed into the 
knapsack if and only if the vertices for S plus ai, . . . ,ak can touch c. Details are pro- 
vided in the Appendix. D 

4.2 Approximation Algorithms 

In this section we present approximation algorithms for the Max-WRAC problem, 
for certain classes of supporting graphs. As a common tool for our algorithm we use the 
Maximum Generalized Assignment Problem (GAP) defined as follows: Given 
a set of bins with capacity constraint and a set of items that have a possibly different 
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Fig. 4: An optimal representation for the Max-WRAC problem whose supporting 
graph is a star with center Bq. The striped boxes on the right are those from the trash 
bin. 

size and value for each bin, pack a maximum-valued subset of items into the bins. 
It is known that the problem is NP-complete (KNAPSACK as well as Bin Packing 
are special cases of GAP), and there is a polynomial-time (1 — 1/e) -approximation 
algorithm [8]. In the remainder we assume that there is an a-approximation algorithm 
for the GAP problem, setting a = 1 — 1/e. 

Theorem 5. There exists a polynomial-time a-appmximation algorithm for the Max- 
WRAC problem if the supporting graph is a star 

Proof. Let Bq denote the box corresponding to the center of the star. In any optimal 
solution for the Max-WRAC problem there are four boxes i?i , ^2 , B3 , S4 whose sides 
contain one corner of Bq each. Given Bi, B2,B3, B4, the problem reduces to assigning 
each remaining box Bi to at most one of the four sides of Bo which completely contains 
the contact between Bi and Bq; see Fig. 4. 

This can be formulated as the GAP problem. The four sides are the bins plus a 
trash bin for all boxes not touching Bq, the size of an item is its width for the horizon- 
tal bins and its height for the vertical bins (the size for the trash bin is irrelevant), the 
value of an item is its profit of the adjacency to the central box except for the trash bin 
where all items have value 0. We can now apply the algorithm for the GAP problem, 
which will result in the a-approximation for the set of boxes. To get an approxima- 
tion for the Max-WRAC problem we consider all possible variants of choosing boxes 
-Bi, -82, S3, B4, which increases the runtime only by a polynomial factor. D 

A star forest is a disjoint union of stars. A partition of a graph G into k star forest 
is a partitioning of the edges of G into k sets, each being a star forest. 

Theorem 6. If the supporting graph can be partitioned in polynomial time into k star 
forests, then there exists a polynomial-time a / k-approximation algorithm for the MAX- 
WRAC problem. 

Proof. Consider any representation with maximum total profit, that is, an optimal solu- 
tion to the Max-WRAC problem. Let E* Q Eht the subset of edges that are realized 
as contacts in this representation, and let Wopt be the total profit of this representa- 
tion. Partition the supporting graph into k star forests. Since the edges of the supporting 
graph contain all the edges E* , we find a forest F with 
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Fig. 5: Left: Realizing cycle {vi, 
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Applying Theorem 5 to each star in F and putting the resulting representations dis- 
jointly next to each other, gives the desired representation. D 

Corollary 1. There is an approximation algorithm for the Max-WRAC problem with 

- approximation factor a/2 if the supporting graph is a tree, 

- approximation factor a/6 if the supporting graph is planar 

Proof. It is easy to partition any tree into two star forests in linear time. Moreover, every 
planar graph can be partitioned into three trees in linear time, for example by finding a 
Schnyder wood [18]. Then the three trees can be partitioned into six star forests. The 
results now follow directly from Theorem 6. D 

Our method of partitioning the supporting graph into star forests and choosing the 
best, is likely not optimal. Nguyen et al. [13] show how to find a star forest carrying 
at least half of the profits of an optimal star forest in polynomial- time. However, we 
can not guarantee that the approximation of the optimal star forest carries a positive 
fraction of the total profit in an optimal solution of the Max-WRAC problem. Hence, 
approximating the Max-WRAC problem for general graphs remains an open problem. 
As a step in this direction, we present a constant-factor approximation for supporting 
graphs with bounded maximum degree. First we need the following lemma. 

Lemma 1. For every set of n > 3 boxes we can find a representation realizing any 
given n-cycle in linear-time. 

Proof. Let C ~ (wi, 112, . . . , ««) be any given cycle. We first make boxes vi and w„ 
adjacent horizontally; see Fig. 5. We proceed in steps, adding one or two boxes in each 
step. At each step we consider the rightmost horizontal contact. Let p be the rightmost 
point in the contact Vi n Vj. We maintain that if Vi is the box on top and Vj is the box 
below, then i < j and we have placed precisely the boxes v^ with k < i ot k > j. 

Now consider the box with rightmost right side. If it is Vi we place the box Vj^i 
with its top-left corner onto p. If it is Vj we place the box w^+i with its bottom-left 
corner onto p. If the right sides of Vi and Vj are collinear and j — j > 2 we place Vj-i 
with its top-left corner slightly above p and Vi^i with its bottom-left corner onto the 
top-left corner of fj_i. If j — i = 2, that is, w^+i = Vj^i is the last box, we place it 
with its top-left comer slightly above p. 

In either case, after each step the current representation realizes a cycle of the form 



{vi,...,Vi,Vj 



i) for some i < j. In the example in Fig. 5 the boxes where added 



as follows: {wi,-yio}, {vq}, {V2}, {vs}, {u4,t'8}, {^s}, {^7}, {^^e}- 



D 



Similar to Theorem 6, from Lemma 1 we can obtain an approximation algorithm 
for the Max-WRAC problem, in case the supporting graph can be covered by few sets 
of disjoint cycles. 

Theorem 7. If one can find in polynomial time k sets of disjoint cycles that together 
cover the edges of the supporting graph, then one can find in polynomial time a repre- 
sentation with total profit at least ^ 'Ylii=ij Pij ■ ^^ particular, this is a polynomial-time 
1 / k-approximation algorithm for the Max-WRAC problem. 

Corollary 2. There is a polynomial-time -^rj-approximation algorithm for the Max- 
WRAC problem if the supporting graph has maximum degree A. 

Proof. As Peterson shows [16], the edges of any graph of maximum degree A can be 
covered by [y] sets of cycles, and such sets can be found in polynomial time. The 
result now follows from Theorem 7. D 

4.3 An Extremal Max-WRAC Problem 

Consider a set i3 = {^i, • • • , S„} of n boxes with fixed dimensions, the complete 
graph, G = Kn, as support graph, and all profits worth 1 unit. Denote by f{B) the 
maximum number of adjacencies that can be realized among the n boxes in B. Further 
we define /(n) = min{/(Z?) : \B\ — n}. 

Theorem 8. For n = 2, 3, 4 we have f{n) — 2n — 3 and for every n > 5 we have 

f{n) = 2n - 2. 

Proof. It is easy to verify the lower bound for the base cases /(n) > 2n — 3 for n = 
2, 3, 4. So let n > 5 and fix ,B = {^i, . . . , B^} to be any set of n boxes. We have 
to show that f{B) > 2n — 2, i.e., that we can position the boxes so that 2n — 2 pairs 
of boxes touch. We start by selecting five arbitrary boxes Bi, i?2, -B3, S4, Br-,. Without 
loss of generality, let Bi and B2 be the boxes with largest height, and S3 and i?4 be the 
boxes with largest width among {i?3, ^4, iJs}. We place the five boxes as in Fig. 5. The 
remaining n — 5 boxes are added to the picture in any order in such a way that every 
box realizes two adjacencies at the time it is placed. To this end it is enough to apply 
the procedure described in Lemma 1 taking _B2, B^ as the first two boxes. 

Next consider the upper bounds. We have f{n) < 2n — 3 for n = 2,3 simply 
because a pair of boxes can touch only once. We have /(4) < 5 because contact graphs 
of boxes are planar graphs in which every triangle is an inner face, which rules out K4^. 
So let n > 5. We show that f{n) < 2n — 2, by constructing a set of n boxes for which, 
in any arrangement of the boxes, at most 2n — 2 pairs of boxes touch. For i = 1, . . . ,n 
we define Bi to be a square box of side length 2\ Consider any placement of the boxes 
Bi, . . . , Bn- We partition the contacts into horizontal contacts and vertical contacts, 
depending on whether the two boxes touch with horizontal sides or vertical sides. From 
the side length of boxes, it now follows that neither set of contacts contains a cycle, i.e., 
consists of at most n — 1 contacts. This gives at most 2n — 2 contacts in total. D 

5 The Area-WRAC problem 

Not all contact representations realizing the same adjacencies are equally practically 
useful (or visually appealing) when viewed as word clouds. Here we consider the 
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Area-WRAC problem and show that finding a "compact" representation, fitting into 
a small bounding box, is another hard problem. In particular, we are given a supporting 
graph G, which is known to be realizable and the goal is to find a representation that 
still realizes G and additionally fits into a small bounding box. 

The reductions are from the (strongly) NP-hard 2D Strip Packing problem, de- 
fined as follows. We are given a set J? = {ri, r2, . . . r„} of n rectangles with height 
and weight functions: w : R ^ N, h : R ~> N. All the widths and heights are integers 
bounded by some polynomial in n. We are also given a strip of width W and infinite 
height and a positive integer H, also bounded by a polynomial in n. The task is to pack 
the given rectangles into the strip such that the total height is at most H. 

The Strip Packing problem is actually equivalent to the Area-WRAC problem 
when the supporting graph is an n-vertex independent set, because it boils down to 
deciding whether all the rectangles can be packed into a bounding box of dimensions 
W X H. However, edges in the supporting graph impose additional constraints on the 
representation, which might make the Area-WRAC problem easier. The following 
theorem (proof is in the Appendix) shows that this is not the case. 

Theorem 9. Area-WRAC is HP-hard, even if the supporting graph is a path. 

6 Experimental Results 

We implemented the algorithm from Corollary 1 for planar graphs (referred to as Pla- 
nar) and compared it with the algorithm from [4] (referred to as CPDWCV). Our 
data set is 120 Wikipedia documents, with 400 words or more. For the word clouds 
we chose the 100 most frequent words (after removing stop-words, e.g., "and", "the", 
"of"), and constructed supporting graph G with 100 vertices. Details are provided in 
the Appendix. 

We compare the percentage of realized profit in the representation of G for the 
two algorithms. Since Planar handles planar supporting graphs, we first extract a 
maximal planar subgraph Gpianar of G, and then we apply the algorithm on Gpi^nar- For 
CPDWCV we compute the results for graph G. The percentage of realized profit is 
presented in the table. Our results indicate that, in terms of the realized profit, PLANAR 
performs significantly better than the heuristic CPDWCV. Although we only prove a 
g (l — i) « 0.1054-approximation for planar graphs (Corollary 1 in combination with 
Theorem 5), in practice PLANAR realizes more than 25% of the total profit of planar 
graphs. 

Algorithm Realized Profit of G Realized Profit of Gpianar 

Planar 8.56% 27.48% 

CPDWCV 0.77% 



7 Conclusions and Future Work 

We formulated the Word Rectangle Adjacency Contact (WRAC) problem, motivated 
by the desire to provide theoretical guarantees for semantic-preserving word cloud vi- 
sualization. We described efficient polynomial-time algorithms for variants of WRAC, 
showed that some variants are NP-complete, and described several approximation algo- 
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rithms. A natural open problem is to find an approximation algorithm for general graphs 
with arbitrary profits. 

Acknowledgements: Work on this problem began at Dagstuhl Seminar 12261. We 
thank the organizers, participants, and especially Steve Chaplick, Sara Fabrikant, Anna 
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Appendix 

Proof (Proof of Theorem 2). Let G be the supporting, quasi-triangulated graph. We 
consider G embedded in the plane with outer face {vn,ve,vs,vw}- Note that this 
embedding is unique. Abusing notation, we refer to a vertex and its corresponding box 
with the same letter. 

We begin by placing a horizontal and a vertical ray emerging from the same point 
in positive x-direction and positive y-direction, respectively. For the first phase of the 
algorithm let us pretend that the horizontal ray is the box vs (imagine a rectangle with 
tiny height and huge width) and the vertical ray is the box vw (imagine a rectangle with 
tiny width and huge height), independent of how the actual boxes look like; see Fig. 6. 
kvw 
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Fig. 6: Left: starting configuration with rays vs and vw- Center: representation at an 
intermediate step: vertex w fits into concavity p and is applicable, vertex v fits into 
concavity s but is not applicable. Adding box w to the representation introduces new 
concavity q, and the vertex at concavity r may become applicable. Right: there is no 
applicable vertex and the algorithm terminates. 

We build up a representation by adding one rectangle at a time. At every inter- 
mediate step the representation is rectilinear convex, that is, its intersection with any 
horizontal or vertical line is connected. In other words, the representation has no holes 
and a "staircase shape". We maintain the set of all concavities, that is, points on the 
boundary of the representation, which are bottom-right or top-left corners of some rect- 
angle but not a top-right corner of any rectangle. Initially there is only one concavity, 
namely the point where the rays vw and vs meet. 

Each concavity p is a point on the boundary of two rectangles, say u and v. Since 
G has no separating triangles there are exactly two vertices that are adjacent to both, 
u and V, or only one if {u, u} ~ {wg, vw}- For exactly one of the these vertices, call 
it w, the rectangle is not yet placed because its bottom-left comer is supposed to be 
placed on the concavity p. We say that w fits into the concavity p. We call a vertex w 
applicable to an intermediate representation if it fits into some concavity and adding 
the rectangle w gives a representation that is rectilinear convex. In the very beginning 
the unique common neighbor of vs and vw is applicable. 

The algorithm proceeds in n — 4 steps as follows. At each step we identify a inner 
vertex w of G that is applicable to the current representation. We add the rectangle w 
to the representation and update the set of concavities and applicable vertices. At most 
two points have to be added to the set of concavities, while one is removed from this 
set. The vertices that fit into the new concavities can easily be read off from the plane 
embedding of G. Checking whether these vertices are applicable is easy. If the top-left 
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or bottom-right corner of w does not define a concavity then one has to check whether 
the vertices that fit into existing concavities to the left or below, respectively, are now 
applicable. So each step can be done in constant time. 

If the algorithm has placed the last inner vertex, it suffices to check whether the 
representation without the two rays is a rectangle, that is, whether there are exactly 
two concavities left. If so, call this rectangle R, we check whether the width of R is at 
most the width of vn and vs and whether the height of R is at most the height of ve 
and vw If this holds true, we can easily place the rectangles v^, ve, vs, vy/ to get a 
representation that realizes G. The total running time is linear. 

On the other hand, if the algorithm stops because there is no applicable vertex, or 
the height/width-conditions in the end phase are not met, then there is no representation 
that realizes G. This is due to the lack of choice in building the representation - if a 
vertex v is applicable to a concavity p then the bottom-left corner of v has to be placed 
at p in order to establish the contacts of v with the two rectangles containing p. D 

Proof (Proof of Theorem 4). We use a reduction from KNAPSACK, which is defined as 
follows. Given a set of n items, each with a positive weight Wi, i ~ 1, . . . ,n, a positive 
profit Pi, i = 1, . . . ,n, a knapsack with some positive capacity C, and a positive number 
P, the task is to find a subset of items whose sum of weights does not exceed G and 
whose sum of profits is at least P. This classical problem is known to be weakly NP- 
complete. 

The reduction is similar to the one presented in the proof of Theorem 1. Given 
an instance / = {{wi,pi), . . . , (w„,p„), C\ P} of KNAPSACK we define an edge- 
weighted star 5/ on n + 5 vertices as follows. There is a vertex Vi for each i — 1, . . . , n, 
a vertex c, and five vertices ai, 02, 03, 04, 05. Vertex c is the center of the star Sj, its 



edge to Vi has weight pi for i 
k^ 1,2, 3, 4, 5; see Fig. 7. 
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Fig. 7: Left: edge-weighted star Sj, defined from instance / of the KNAPSACK problem. 
Right: optimal solution toMAX-WRAC for Sj. 

As before, we use v ^' {h,w) to define the box of v with height h and width w. We 
define Vi — >■ (1, Wi) fori = 1, . . . ,n, a^ — >■ (C, C) for fc = 1, 2, 3,4, 5, andc -^ (1, C). 
Finally, we define the target profit in the Max-WRAC problem Pj — 5 X]"=i Pi + P- 

We claim that an instance / of the KNAPSACK problem is feasible if and only if 
the instance of the Max-WRAC problem corresponding to Sj is feasible. From any 
solution of the Max-WRAC problem we can read off a solution for the KNAPSACK 
problem. 

First note that every solution of the Max-WRAC problem has total profit strictly 
more than 5 J27=i Pi- Thus all adjacencies between c and a^ for fc = 1, 2, 3, 4, 5 are 
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realized and each Ofc contains a corner of c. It follows that at least three sides of c are 
partially covered by some a^ and at least one horizontal side of c is completely covered 
by some aj.. Because c has height 1 none of the boxes wi, . . . , w„ (each of height 1) 
touches c on the side. Hence each Vi touches c (if at all) on a horizontal side, say the 
bottom; see Fig. 7. 

Now the bottom side of c has width C and each box Vi has width Wi, i = 1, . . . ,n. 



i} of indices of boxes that touch c satisfies ^ 



je-J 



Wi < 



Thus the subset of J C {1, . 

C. Moreover the total profit of the representation is 5 X]r=i Pi + ^jejPi' which is at 
least Pj if and only if J^je J Pj — ^' ^^'^^ ^^' '^^ items with indices in J are a solution 
of the Knapsack problem. 

Along the same lines, we can construct a solution for the Max-WRAC problem 
based on any solution of the KNAPSACK problem, and this concludes the proof. D 

Proof. We use a reduction from Strip Packing, so fix any instance / of Strip Packing 
consisting of rectangles ri, . . . ,rn and two integers H and W. Let d = j^^^n^ u\ for 
some e e (0, 1). 

We define an instance of the Area-WRAC problem by slightly increasing the 
heights and widths in /. The idea is to lay a unit square grid over the strip and blow 
each grid line up to have a thickness of d; see Fig. 8. Each rectangle in / is stretched 
according to the number of grid lines is intersects. 




Fig. 8: Grid before and after stretching 

More precisely, we define for z = 1, . . . , n a rectangle r[ of width w{ri) + {w{ri) — 
l)d and height h{r.i) + {h{ri) - l)d. Further we define W' = W + {W - l)d and 
H' = H + {H — \)d. Finally, we arrange the rectangles r'^, . . . ,r'^ into a path P by 
introducing between Tj and r^+i (i = 1, . . . , ti — 1), as well as before r\ k small x x x 
square, called connector squares. We choose k and x to satisfy 



kx ^ 4,{n + ■i){H + 2nW) 



and 



n{kx'^ + 2x) = d. 
In particular, we choose 



(6) 
(7) 



k = 



2n{2Hn + 6H + An^W + \2nW + 1) 
4(n + 3)(iI + 2nW^) 



and 
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We claim that there is a representation realizing P within the W' x H' bounding box 
if and only if the original rectangles ri , . . . , r„ can be packed into the original W x H 
bounding box. 

First consider any representation realizing P within the W x H' bounding box and 
remove all connector squares from it. Since W' < W + e < W + 1 and H' < H + e < 
H + 1, the stretched bounding box has the same number of grid lines than the origi- 
nal. Hence the rectangles r[, . . . , r^^ can be replaced by the corresponding rectangles 
ri , . . . , r„ and perturbed slightly such that every corner lies on a grid point. This way 
we obtain a solution for the original instance of Strip Packing. 

Now consider any solution for the Strip Packing instance, i.e., any packing of 
the rectangles ri , . . . , r„ within the W x H bounding box. We will construct a repre- 
sentation realizing the path P within the W' x H' bounding box. We start blowing up 
the grid lines of the W x H bounding box to thickness d each, which also effects all 
rectangles intersected by a grid line in its interior. This way we obtain a placement of 
bigger rectangles r'j^ , . . . , r[-^ I' in the bigger W' x H' bounding box, such that every 
rectangle r'^ intersects the interiors of exactly those blown-up grid lines corresponding 
to the grid lines that intersect r^ interiorly. Thus any two rectangles r'^ and r' are sepa- 
rated by a vertical or horizontal corridor of thickness at least d. We will refer to the grid 
lines of thickness d as gaps. 

It remains to place all the connector square so as to realize the path P. The idea 
is the following. We start in the lower left corner of the bounding box, and lay out 
connector squares horizontally to the right inside the bottommost horizontal gap until 
we reach the vertical gap that contains the lower- left corner of r[. We then start laying 
out the connector squares inside this vertical gap upwards, until we reach the lower-left 
corner of r^ Whenever a rectangle r^ overlaps with this vertical gap, we go around r^ 
as illustrated in Fig. 9d. This way we lay out at most (SW + H')/x connector squares, 
which by (6) is less than k. The remaining connector squares are "folded up" inside the 
vertical gap; see Fig. 9b. 

Next we lay out the connectors squares between r[ and rg. We start where we ended 
before, i.e., at the lower-left comer of r[, and go the along the path we took before till 
we reach the bottommost gap. Then we lay connector squares along the outermost gaps 
in counterclockwise direction, i.e., first horizontally to the rightmost gap, then up to the 
topmost gap, left to the leftmost gap, and down to the bottommost gap. Now we do the 
same for r'2 than what we did for r[. If while going right we "hit" the connector squares 
going up to r[, we follow them up, go around r[, and go down again. This is possible 
since there are gaps all around r[; see Fig. 9a. Note that the red line of connectors will 
actually sit on the dashed, expanded grid lines but are drawn next to them for better 
readability. 

We repeat this for all the rectangles. 

We have to show two things: The number of connector squares between two r'^ and 
r'^^i is large enough so that the length of the string of connectors is sufficient. And that 
the gaps have sufficient space so that we can fold up the connectors in them. 

The first condition is taken care of by equation (6). We divide the path of the con- 
nectors in up to n + 3 parts: The first part Pdowni is going down from r^ to the bottom 
gap. The second part Pcircie that goes around the bounding box in counterclockwise or- 
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(a) NP hardness proof for Area-WRAC of paths 



(b) Folding of connec- 
tor rectangles inside a 
gap 




(c) Connectors before rerouting (d) Connectors after rerouting 

Fig. 9: Illustrations for Theorem 9. 

der to the vertical gap containing the lower-left comer of r\^^. This part is intercepted 
by up to n parts Pavoidt where we hit a string of connectors going up to another rectangle 
r'y. and we have to follow it, go around r^ and come down again. The last part pup. ^ is 
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going up from the bottom gap to the position of r'^^^^. We will now show that each of 
these parts has a maximum length of 4(iJ' + 2nW'). 

The parts pup and Pdown; have to span the height H' at most once, and may en- 
counter all other rectangles rj, at most once. Going around any such rj, means at most 
traversing its width twice, which is at most 2W' . Hence each of pup and Pdown; has a 
total length of at most iJ' + 2nVK' < 4:{H' + 2nW'). Since every pavoidk exactly follows 
the pupj. , then surrounds rj, (which has maximum width W' and maximum height H') 
and then follows Pdown^, it has a maximum length of 2{W' + H') + 2{H' + 2nW') < 
4(iJ' + 2nW'). Finally, pcircie has a maximum length of 2H' + 2W' < A{H' + 2nW'). 

Thus, the total length of the path of connectors comprised of n + 3 parts of at most 
length A{H' + 2nW') each is at most A{n + 3)(iJ' + 2nW'). Equation (6) ensures that 
our string of connectors has sufficient length. 

The second condition is covered by equation (7). Consider Fig. 9b. If a string of 
connectors just passes through a gap, it takes up exactly 1 x a; space. If it folds m 
connector rectangles inside the gap, it takes m x x^ plus the 'wasted' space (the red 
shaded space in Fig. 9b). The wasted space can be at most 1 x 2x, and since every 
string of connectors has k connector rectangles, the space taken up by those can be at 
most kx^, thus every string of connectors can take at most kx"^ + 2x space in any given 
gap. Since there are n such strings of connectors and every gap has dimensions 1 x d, 
equation (7) ensures that the space in every gap is sufficient. 

We showed that we can find a layout of the path that corresponds to the optimum 
packing of the rectangles, if such a packing exists within the desired bounding box. 
Thus, finding the most space-efficient layout for a path of rectangles is NP-hard. D 

Implementation Details Here we provide some details regarding the implementation 
of the algorithms PLANAR and CPDWCV from Section 6. 

Before the algorithms are applied, the text is preprocessed using this workflow: 
The text is split into sentences, and the sentences are split into words using Apache 
OpenNLP. We then remove stop words, perform stemming on the words and group the 
words with the same stem. The similarity of words is computed using Latent Semantic 
Analysis based on the co-occurrence of the words within the same sentence. 

In the implementation of PLANAR, we use the {-Jrj — e)-approximation from [8] 
combined with a FPTAS for KNAPSACK to approximate the stars. In the implemen- 
tation of CPDWCV, we achieved the best results in our experiments with parameters 
Kr = 4000 and Ka =■ 25. One of the results computed by our algorithm is given in 
Fig. 10. 
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Fig. 10: A result of the PLANAR algorithm: Star-Based semantic preserving visualiza- 
tion of Obama's 2013 State of the Union Speech. 
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